Statistics in Medicine — Latest Matching Preprints

1

Highly adaptive LASSO: Machine learning that provides valid nonparametric inference in realistic models

Butzin-Dozier, Z.; Qiu, S.; Hubbard, A. E.; Shi, J.; van der Laan, M.

2024-10-19 epidemiology 10.1101/2024.10.18.24315778 medRxiv

Top 0.1%

52.4%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWUnderstanding treatment effects on health-related outcomes using real-world data requires defining a causal parameter and imposing relevant identification assumptions to translate it into a statistical estimand. Semiparametric methods, like the targeted maximum likelihood estimator (TMLE), have been developed to construct asymptotically linear estimators of these parameters. To further establish the asymptotic efficiency of these estimators, two conditions must be met: 1) the relevant components of the data likelihood must fall within a Donsker class, and 2) the estimates of nuisance parameters must converge to their true values at a rate faster than n-1/4. The Highly Adaptive LASSO (HAL) satisfies these criteria by acting as an empirical risk minimizer within a class of cadlag functions with a bounded sectional variation norm, which is known to be Donsker. HAL achieves the desired rate of convergence, thereby guaranteeing the estimators asymptotic efficiency. The function class over which HAL minimizes its risk is flexible enough to capture realistic functions while maintaining the conditions for establishing efficiency. Additionally, HAL enables robust inference for non-pathwise differentiable parameters, such as the conditional average treatment effect (CATE) and causal dose-response curve, which are important in precision health. While these parameters are often considered in machine learning literature, these applications typically lack proper statistical inference. HAL addresses this gap by providing reliable statistical uncertainty quantification that is essential for informed decision-making in health research.

2

Jointly modeling prevalence, sensitivity and specificity for optimal sample allocation

Larremore, D. B.; Fosdick, B. K.; Zhang, S.; Grad, Y. H.

2020-05-26 immunology 10.1101/2020.05.23.112649 medRxiv

Top 0.1%

47.5%

Show abstract

The design and interpretation of prevalence studies rely on point estimates of the performance characteristics of the diagnostic test used. When the test characteristics are not well defined and a limited number of tests are available, such as during an outbreak of a novel pathogen, tests can be used either for the field study itself or for additional validation to reduce uncertainty in the test characteristics. Because field data and validation data are based on finite samples, inferences drawn from these data carry uncertainty. In the absence of a framework to balance those uncertainties during study design, it is unclear how best to distribute tests to improve study estimates. Here, we address this gap by introducing a joint Bayesian model to simultaneously analyze lab validation and field survey data. In many scenarios, prevalence estimates can be most improved by apportioning additional effort towards validation rather than to the field. We show that a joint model provides superior estimation of prevalence, as well as sensitivity and specificity, compared with typical analyses that model lab and field data separately.

3

Evaluation Methods for T-association of a Surrogate Endpoint

Hung, J.-Y.; Hsu, C.-Y.; Su, P.-F.; Shyr, Y.

2025-08-29 health policy 10.1101/2025.08.28.25334653 medRxiv

Top 0.1%

34.0%

Show abstract

A surrogate endpoint is a biomarker that is reasonably likely to predict clinical benefit and is used as a substitute for a direct measure of clinical benefit under the Food and Drug Administration (FDA) Accelerated Approval pathway. According to FDA guidelines, a valid surrogate endpoint must meet two associations: I-association (the association between the surrogate and true endpoints, such as disease response and overall survival) and T-association (the association between treatment effects on both endpoints, such as odds ratio and hazard ratio). I-association is commonly evaluated, but T-association is often overlooked due to the lack of appropriate statistical methods. Failure to satisfy T-association precludes a biomarker from supporting accelerated approval. To address this gap, we propose a new method to rigorously assess T-association in accordance with FDA guidelines. This method assumes that treatment effects on the surrogate and true endpoints follow a bivariate normal distribution, accounting for both within-study and between-study variances. The key evaluation metric is the correlation coefficient, which quantifies the relationship between treatment effects on both endpoints. Model parameters, including this correlation, are estimated using maximum likelihood, restricted maximum likelihood, and a Bayesian approach. We demonstrate the method using both simulated and real-world data. The method will serve as the statistical foundation that aligns with FDA guidelines and supports future accelerated approvals. The R package to implement the proposed method is available at https://github.com/jybelindahung/T-association.

4

A simulation-based procedure to estimate base rates from Covid-19 antibody test results I: Deterministic test reliabilities

Joosten, R.; Abhishta, A.

2020-05-04 infectious diseases 10.1101/2020.04.28.20075036 medRxiv

Top 0.1%

33.0%

Show abstract

We design a procedure (the complete Python code may be obtained at https://github.com/abhishta91/antibody_montecarlo) using Monte Carlo (MC) simulation to establish the point estimators described below and confidence intervals for the base rate of occurence of an attribute (e.g., antibodies against Covid-19) in an aggregate population (e.g., medical care workers) based on a test. The requirements for the procedure are the tests sample size (N) and total number of positives (X), and the data on tests reliability. The modus is the prior which generates the largest frequency of observations in the MC simulation with precisely the number of test positives (maximum-likelihood estimator). The median is the upper bound of the set of priors accounting for half of the total relevant observations in the MC simulation with numbers of positives identical to the tests number of positives. O_LSTOur rather preliminary findings areC_LSTO_LIThe median and the confidence intervals suffice universally. C_LIO_LIThe estimator [Formula] may be outside of the two-sided 95% confidence interval. C_LIO_LIConditions such that the modus, the median and another promising estimator which takes the reliability of the test into account, are quite close. C_LIO_LIConditions such that the modus and the latter estimator must be regarded as logically inconsistent. C_LIO_LIConditions inducing rankings among various estimators relevant for issues concerning over-or underestimation. C_LI JEL-codes: C11, C13, C63

5

A Compound Model of Multiple Treatment Selection with Applications to Marginal Structural Modeling

Stein, D. W.; Gaspar, F.; Piantadosi, S.; Amin, A.; Webb, B.; Lu, D.; D'Arinzo, L.; Oliver, M.; Fitzgerald, K.

2023-02-10 epidemiology 10.1101/2023.02.08.23285425 medRxiv

Top 0.1%

24.4%

Show abstract

Methods of causal inference are used to estimate treatment effectiveness for non-randomized study designs. The propensity score (i.e., the probability that a subject receives the study treatment conditioned on a set of variables related to treatment and/or outcome) is often used with matching or sample weighting techniques to, ideally, eliminate bias in the estimates of treatment effect due to treatment decisions. If multiple treatments are available, the propensity score is a function of the adjustment set and the set of possible treatments. This paper develops a compound model that separates the treatment decision into a binary decision: treat or dont treat; and a potential treatment decision: choose the treatment that would be given if the subject is treated. It is applicable if the treatment set is finite, treatments are given at one time point, and the outcome is observed at a fixed time point. This representation can reduce bias when not all treatments are available to all patients. Multiple treatment stabilized marginal structural weights were calculated with this approach, and the method was applied to an observational study to evaluate the effectiveness of different neutralizing monoclonal antibodies to treat infection with various severe acute respiratory syndrome coronavirus 2 variants.

6

A latent outcome variable approach for Mendelian randomization using the expectation maximization algorithm

Amente, L. D.; Mills, N. T.; Le, T. D.; Hypponen, E.; Lee, S. H.

2024-08-26 epidemiology 10.1101/2024.08.24.24312485 medRxiv

Top 0.1%

22.6%

Show abstract

Mendelian randomization (MR) is a widely used tool to uncover causal relationships between exposures and outcomes. However, existing MR methods can suffer from inflated type I error rates and biased causal effects in the presence of invalid instruments. Our proposed method enhances MR analysis by augmenting latent phenotypes of the outcome, explicitly disentangling horizontal and vertical pleiotropy effects. This allows for explicit assessment of the exclusion restriction assumption and iteratively refines causal estimates through the expectation-maximization algorithm. This approach offers a unique and potentially more precise framework compared to existing MR methods. We rigorously evaluate our method against established MR approaches across diverse simulation scenarios, including balanced and directional pleiotropy, as well as violations of the Instrument Strength Independent of Direct Effect (InSIDE) assumption. Our findings consistently demonstrate superior performance of our method in terms of controlling type I error rates, bias, and robustness to genetic confounding. Additionally, our method facilitates testing for directional horizontal pleiotropy and outperforms MR-Egger in this regard, while also effectively testing for violations of the InSIDE assumption. We apply our method to real data, demonstrating its effectiveness compared to traditional MR methods. This analysis reveals the causal effects of body mass index (BMI) on metabolic syndrome (MetS) and a composite MetS score calculated by the weighted sum of its component factors. While the causal relationship is consistent across most methods, our proposed method shows fewer violations of the exclusion restriction assumption, especially for MetS scores where horizontal pleiotropy persists and other methods suffer from inflation.

7

An efficient distributed algorithm with application to COVID-19 data from heterogeneous clinical sites

Tong, J.; Luo, C.; Islam, M. N.; Sheils, N.; Buresh, J.; Edmondson, M.; Merkel, P. A.; Lautenbach, E.; Duan, R.; Chen, Y.

2020-11-18 epidemiology 10.1101/2020.11.17.20220681 medRxiv

Top 0.1%

22.3%

Show abstract

ObjectivesIntegrating electronic health records (EHR) data from several clinical sites offers great opportunities to improve estimation with a more general population compared to analyses based on a single clinical site. However, sharing patient-level data across sites is practically challenging due to concerns about maintaining patient privacy. The objective of this study is to develop a novel distributed algorithm to integrate heterogeneous EHR data from multiple clinical sites without sharing patient-level data. Materials and MethodsThe proposed distributed algorithm for binary regression can effectively account for between-site heterogeneity and is communication-efficient. Our method is built on a pairwise likelihood function in the extended Mantel-Haenszel regression, which is known to be statistically highly efficient. We construct a surrogate pairwise likelihood function through approximating the target pairwise likelihood by its surrogate. We show that the proposed surrogate pairwise likelihood leads to a consistent and asymptotically normal estimator by effective communication without sharing individual patient-level data. We study the empirical performance of the proposed method through a systematic simulation study and an application with data of 14,215 COVID-19 patients from 230 clinical sites at UnitedHealth Group Clinical Research Database. ResultsThe proposed method was shown to perform close to the gold standard approach under extensive simulation settings. When the event rate is <5%, the relative bias of the proposed estimator is 30% smaller than that of the meta-analysis estimator. The proposed method retained high accuracy across different sample sizes and event rates compared with meta-analysis. In the data evaluation, the proposed estimate has a relative bias <9% when the event rate is <1%, whereas the meta-analysis estimate has a relative bias at least 10% higher than that of the proposed method. ConclusionsOur simulation study and data application demonstrate that the proposed distributed algorithm provides an estimator that is robust to heterogeneity in event rates when effectively integrating data from multiple clinical sites. Our algorithm is therefore an effective alternative to both meta-analysis and existing distributed algorithms for modeling heterogeneous multi-site binary outcomes.

8

On Multiply Robust Mendelian Randomization (MR2) With Many Invalid Genetic Instruments

Sun, B.; Liu, Z.; Tchetgen Tchetgen, E.

2021-10-26 epidemiology 10.1101/2021.10.21.21265317 medRxiv

Top 0.1%

22.2%

Show abstract

Mendelian randomization (MR) is a popular instrumental variable (IV) approach, in which genetic markers are used as IVs. In order to improve efficiency, multiple markers are routinely used in MR analyses, leading to concerns about bias due to possible violation of IV exclusion restriction of no direct effect of any IV on the out-come other than through the exposure in view. To address this concern, we introduce a new class of Multiply Robust MR (MR2) estimators that are guaranteed to remain consistent for the causal effect of interest provided that at least one genetic marker is a valid IV without necessarily knowing which IVs are invalid. We show that the proposed MR2 estimators are a special case of a more general class of estimators that remain consistent provided that a set of at least k{dagger} out of K candidate instrumental variables are valid, for k{dagger}[≤] K set by the analyst ex ante, without necessarily knowing which IVs are invalid. We provide formal semiparametric theory supporting our results, and characterize the semiparametric efficiency bound for the exposure causal effect which cannot be improved upon by any regular estimator with our favorable robustness property. We conduct extensive simulation studies and apply our methods to a large-scale analysis of UK Biobank data, demonstrating the superior empirical performance of MR2 compared to competing MR methods.

9

Bayesian Variable Selection Utilizing Posterior Probability Credible Intervals

Du, M.; Andersen, S. L.; Perls, T. T.; Sebastiani, P.

2021-01-15 epidemiology 10.1101/2021.01.13.21249759 medRxiv

Top 0.1%

19.5%

Show abstract

In recent years, there has been growing interest in the problem of model selection in the Bayesian framework. Current approaches include methods based on computing model probabilities such as Stochastic Search Variable Selection (SSVS) and Bayesian LASSO and methods based on model choice criteria, such as the Deviance Information Criterion (DIC). Methods in the first group compute the posterior probabilities of models or model parameters often using a Markov Chain Monte Carlo (MCMC) technique, and select a subset of the variables based on a prespecified threshold on the posterior probability. However, these methods rely heavily on the prior choices of parameters and the results can be highly sensitive when priors are changed. DIC is a Bayesian generalization of the Akaikes Information Criterion (AIC) that penalizes for large number of parameters, it has the advantage that can be used for selection of mixed effect models but tends to prefer overparameterized models. We propose a novel variable selection algorithm that utilizes the parameters credible intervals to select the variables to be kept in the model. We show in a simulation study and a real-world example that this algorithm on average performs better than DIC and produces more parsimonious models.

10

Joint modeling of survival and backwards recurrence outcomes: an analysis of factors associated with fertility treatment

Guo, S.; Zhang, J.; McLain, A. C.

2022-02-25 epidemiology 10.1101/2022.02.24.22271471 medRxiv

Top 0.1%

19.1%

Show abstract

The increase in methods focused on various types of survival outcomes has allowed practitioners to analyze data that are difficult or expensive to prospectively observe. Still, there are populations that are challenging to study. For example, obtaining a representative sample of couples attempting to become pregnant is difficult due to the dynamic nature of the population. This has led to an increase in the use of cross-sectional designs yielding backwards recurrent survival outcomes. In this paper, we consider the analysis of a survival outcome where subjects are observed if they are at-risk for a separate dependent survival outcome. The motivation for this problem is to determine which factors are associated with time-to-fertility-treatment (TTFT) among women currently attempting pregnancy in a cross-sectional sample. We propose appending a marginal accelerated failure time (AFT) model on TTFT with a conditional model on time-to-pregnancy (TTP) given TTFT to account for their dependence and avoid biases. We address challenges that arise due to the censoring of TTFT and the resulting increased computational complexity. The performance is validated via comprehensive simulation studies. We apply our approach to data from the National Survey of Family Growth to estimate the association insurance type has on TTFT, and estimate the impact of fertility treatment on TTP.

11

Assessment of non-linear mixed effects model-based approaches to test for drug effect using simulated data: type I error and power properties

Chasseloup, E.; Tessier, A.; Karlsson, M. O.

2024-04-17 pharmacology and toxicology 10.1101/2024.04.13.589388 medRxiv

Top 0.1%

18.5%

Show abstract

1Pharmacometric approaches achieves higher power to detect a drug effect compared to traditional statistical hypothesis tests. Known drawbacks come from the model building process where multiple testing and model misspecification are major causes for type I error inflation. IMA is a new approach using mixture models and the likelihood ratio test (LRT) to test for drug effect. It previously showed type I error control and unbiased drug estimates in the context of two-arms balanced designs using real placebo data, in comparison to the standard approach (STD). The aim of this study was to extend the assessment of IMA and STD regarding type I error, power, and bias in the drug effect estimates under various types of model misspecification, with or without LRT calibration. Two classical statistical approaches, t-test and Mixed-Effect Model Repeated Measure (MMRM), were also added to the comparison. The focus was a simulation study where the extent of the model misspecification is known, using a response model with or without drug effect as motivating example in two sample size scenarios. The IMA performances were overall not impacted by the sample size or the LRT calibration, contrary to STD which had better type I error results with the larger sample size and calibrated LRT. In terms of power STD required LRT calibration to outperform IMA. T-test and MMRM had both controlled type I error. The t-test had a lower power than both STD and IMA while MMRM had power predictions similar to IMA. IMA and STD had similarly unbiased drug effect estimates, with few exceptions. IMA showed again encouraging performances (type I error control and unbiased drug estimates) and presented reasonable power predictions. The IMA performances were overall more robust towards model mis-specification compared to STD. IMA confirmed its status of promising NLMEM-based approach for hypothesis testing of the drug effect and could be used in the future, after further evaluations, as primary analysis in confirmatory trials.

12

Optimising a coordinate ascent algorithm for the meta-analysis of test accuracy studies

Baragilly, M. H.; Willis, B. H.

2022-12-08 bioinformatics 10.1101/2022.12.05.519131 medRxiv

Top 0.1%

17.7%

Show abstract

Meta-analysis may be used to summarise a tests accuracy. Often the sensitivity and specificity are the measures of interest and as these are correlated a bivariate random effects model is commonly used to fit the data. This model has five parameters and it may be optimised using a Newton-Raphson based algorithm providing adequate initial values of the parameters are identified. Numerical methods may be used to estimate robust initial values but estimating these is computationally expensive and it is not clear whether they provide a significant advantage over closed form methods in terms of reducing bias, mean square error, average relative error, and coverage probability. Here we consider six closed form methods for estimating the initial values of the parameters for a co-ordinate ascent algorithm used to fit the bivariate model and compare them with numerically derived robust initial values. Using simulation studies we demonstrate that all the closed form methods lead to a reduction in computation time of around 80% and rank higher overall across the metrics when compared with the robust initial values method. Although no initial values estimator dominated the others across all parameters and metrics, the two-step Hedges-Olkin estimator ranked highest overall across the different scenarios.

13

Multiple imputation assuming missing at random: auxiliary imputation variables that only predict missingness can increase bias due to data missing not at random

Curnow, E.; Cornish, R. P.; Heron, J.; Carpenter, J. R.; Tilling, K.

2023-10-17 epidemiology 10.1101/2023.10.17.23297137 medRxiv

Top 0.1%

17.6%

Show abstract

Epidemiological studies often have missing data, which are commonly handled by multiple imputation (MI). MI is valid (given correctly-specified models) if data are missing at random, conditional on the observed data, but not (unless additional information is available) if data are missing not at random (MNAR). In this paper we explore a previously-suggested strategy, namely, including an auxiliary variable predictive of missingness but not the missing data in the imputation model, when data are MNAR. We quantify, algebraically and by simulation, the magnitude of additional bias of the MI estimator, over and above any bias due to data MNAR, from including such an auxiliary variable. We demonstrate that where missingness is caused by the outcome, additional bias can be substantial when the outcome is partially observed. Furthermore, if missingness is caused by the outcome and the exposure, additional bias can be even larger, when either the outcome or exposure is partially observed. When using MI, it is important to identify, through a combination of data exploration and considering plausible casual diagrams and missingness mechanisms, the auxiliary variables most predictive of the missing data (in addition to all variables required for the analysis model and/or to minimise bias due to MNAR).

14

Treatment group outcome variance difference after dropout as an indicator of missing-not-at-random bias in randomized clinical trials

Hazewinkel, A.-D.; Tilling, K.; Wade, K. H.; Palmer, T. M.

2022-04-18 epidemiology 10.1101/2022.04.15.22273918 medRxiv

Top 0.1%

15.2%

Show abstract

Randomized controlled trials (RCTs) are considered the gold standard for assessing the causal effect of an exposure on an outcome, but are vulnerable to bias from missing data. When outcomes are missing not at random (MNAR), estimates from complete case analysis (CCA) will be biased. There is no statistical test for distinguishing between outcomes missing at random (MAR) and MNAR, and current strategies rely on comparing dropout proportions and covariate distributions, and using auxiliary information to assess the likelihood of dropout being associated with the outcome. We propose using the observed variance difference across treatment groups as a tool for assessing the risk of dropout being MNAR. In an RCT, at randomization, the distributions of all covariates should be equal in the populations randomized to the intervention and control arms. Under the assumption of homogeneous treatment effects, the variance of the outcome will also be equal in the two populations over the course of followup. We show that under MAR dropout, the observed outcome variances, conditional on the variables included in the model, are equal across groups, while MNAR dropout may result in unequal variances. Consequently, unequal observed conditional group variances are an indicator of MNAR dropout and possible bias of the estimated treatment effect. Heterogeneity of treatment effect affects the intervention group variance, and is another potential cause of observing different outcome variances. We show that, for longitudinal data, we can isolate the effect of MNAR outcome-dependent dropout by considering the variance difference at baseline in the same set of patients that are observed at final follow-up. We illustrate our method in simulation and in applications using individual-level patient data and summary data.

15

Bridge Category Models: Development of Bayesian Modelling Procedures to Account for Bridge Ordinal Ratings for Disease Staging

Levy, J.; Bobak, C. A.; Azizgolshani, N.; Andersen, M. J.; Suriawinata, A.; Liu, X.; Lisovsky, M.; Ren, B.; Christensen, B.; Vaickus, L.; O'Malley, J.

2021-08-18 pathology 10.1101/2021.08.17.456726 medRxiv

Top 0.1%

14.9%

Show abstract

Disease grading and staging is accomplished through the assignment of an ordinal rating. Bridge ratings occur when a rater assigns two adjacent categories. Most statistical methodology necessitates the use of a single ordinal category. Consequently, bridge ratings often go unreported in clinical research studies. We propose three methodologies (Expanded, Mixture, and Collapsed) Bridge Category Models, to account for bridge ratings. We perform simulations to examine the impact of our approaches on detecting treatment effects, and comment on a real-world scenario of staging liver biopsies. Results indicate that if bridge ratings are not accounted for, disease staging models may exhibit significant bias and precision loss. All models worked well when they corresponded to the data generating mechanism.

16

Accounting for Twins and Other Multiple Births in Perinatal Studies Conducted Using Healthcare Administration Data

Brown, J. P.; Yland, J. J.; Williams, P. L.; Huybrechts, K. F.; Hernandez-Diaz, S.

2024-01-24 epidemiology 10.1101/2024.01.23.24301685 medRxiv

Top 0.1%

14.6%

Show abstract

The analysis of perinatal studies is complicated by twins and other multiple births even when they are not the exposure, outcome, or a confounder of interest. Common approaches to handling multiples in studies of infant outcomes include restriction to singletons, counting outcomes at the pregnancy-level (i.e., by counting if at least one twin experienced a binary outcome), or infant-level analysis including all infants and, typically, accounting for clustering of outcomes by using generalised estimating equations or mixed effects models. Several healthcare administration databases only support restriction to singletons or pregnancy-level approaches. For example, in MarketScan insurance claims data, diagnoses in twins are often assigned to a single infant identifier, thereby preventing ascertainment of infant-level outcomes among multiples. Different approaches correspond to different causal questions, produce different estimands, and often rely on different assumptions. We demonstrate the differences that can arise from these different approaches using Monte Carlo simulations, algebraic formulas, and an applied example. Furthermore, we provide guidance on the handling of multiples in perinatal studies when using healthcare administration data.

17

Unjustified Poisson assumptions lead to overconfident estimates of the effective reproductive number

Nemcova, B.; Goldstein, I. H.; Sebastian, J.; Minin, V. M.; Bracher, J.

2025-07-31 public and global health 10.1101/2025.07.31.25332479 medRxiv

Top 0.1%

14.4%

Show abstract

Time-varying effective reproductive numbers of infectious diseases are commonly estimated using renewal equation models. In the widely applied R package EpiEstim and various related tools, this approach is combined with a Poisson distributional assumption. This has been criticized on various occasions, mostly on grounds of general model realism or a desire to estimate overdispersion parameters. Here we argue that an important issue arising from the Poisson assumption is that inference about the effective reproductive number becomes overconfident in presence of overdispersion. By how much standard errors are underestimated follows in a straightforward manner from theory on generalized linear models. We therefore recommend to replace the Poisson assumption by quasi-Poisson or negative binomial extensions, and contrast their respective properties. We illustrate our arguments in detailed simulation studies and three examples of case studies of Ebola, pandemic influenza and COVID-19.

18

Estimating Direct and Spillover Vaccine Effectiveness with Partial Interference under Test-Negative Design Sampling

Jiang, C.; Fang, F.; Talbot, D.; Schnitzer, M.

2025-02-25 infectious diseases 10.1101/2025.02.24.25322826 medRxiv

Top 0.1%

14.4%

Show abstract

The Test-Negative Design (TND), which involves recruiting care-seeking individuals who meet predefined clinical case criteria, offers valid statistical inference for Vaccine Effectiveness (VE) using data collected through passive surveillance, making it cost-efficient and timely. Infectious disease epidemiology often involves interference, where the treatment and/or outcome of one individual can affect the outcomes of others, rendering standard causal estimands ill-defined; ignoring such interference can bias VE evaluation and lead to ineffective vaccination policies. This article addresses the estimation of causal estimands for VE in the presence of partial interference using TND samples. Partial interference means that the vaccination of units within the same group/cluster may influence the outcomes of other members of the cluster. We define the population direct, spillover, total, and overall effects using the geometric risk ratio, which are identifiable under TND sampling. We investigate various stochastic policies for vaccine allocation in a counterfactual scenario, and identify policy-relevant VE causal estimands. We propose inverse-probability weighted (IPW) estimators for estimating the policy-relevant VE causal estimands with partial interference under the TND, and explore the statistical properties of these estimators.

19

Quantifying bias from dependent left truncation in survival analyses of real world data

Sondhi, A.; Humblet, O.; Swaminathan, A.

2021-08-05 epidemiology 10.1101/2021.08.02.21261492 medRxiv

Top 0.1%

14.3%

Show abstract

In real world data (RWD) studies, observed datasets are often subject to left truncation, which can bias estimates of survival parameters. Standard methods can only suitably account for left truncation when survival and entry time are independent. Therefore, in the dependent left truncation setting, it is important to quantify the magnitude and direction of estimator bias to determine whether an analysis provides valid results. We conduct simulation studies of common RWD analytic settings in order to determine when standard analysis provides reliable estimates, and to identify factors that contribute most to estimator bias. We also outline a procedure for conducting a simulation-based sensitivity analysis for an arbitrary dataset subject to dependent left truncation. Our simulation results show that when comparing a truncated real-world arm to a non-truncated arm, we observe the estimated hazard ratio biased upwards, providing conservative inference. The most important data-generating parameter contributing to bias is the proportion of left truncated patients, given any level of dependence between survival and entry time. For specific datasets and analyses that may differ from our example, we recommend applying our sensitivity analysis approach to determine how results would change given varying proportions of truncation.

20

The Mathematics of Testing with Application to Prevalence of COVID-19

Hanin, L.

2020-06-02 epidemiology 10.1101/2020.05.31.20118703 medRxiv

Top 0.1%

14.1%

Show abstract

We formulate three basic assumptions that should ideally guide any well-designed COVID-19 prevalence study. We provide, on the basis of these assumptions alone, a full derivation of mathematical formulas required for statistical analysis of testing data. In particular, we express the disease prevalence in a population through those for its homogeneous subpopulations. Although some of these formulas are routinely employed in prevalence studies, the study design often contravenes the assumptions upon which these formulas vitally depend. We also designed a natural prevalence estimator from the testing data and studied some of its properties. The results are equally valid for diseases other than COVID-19 as well as in non-epidemiological settings.